ETC5521 Diving Deeper into Data Exploration: Assignment 1

As per Monash’s integrity rules, these solutions are not to be shared beyond this class.

Author

Prof. Di Cook

Published

July 28, 2025

🎯 Goal

The assignment is designed to assess your knowledge of data wrangling and GitHub is at a level to be able to successfully follow the content of this class. The assignment represents 15% of your final grade for ETC5521. This is an individual assignment.

📌 Guidelines

  1. Accept the GitHub Classroom Assignment provided in Moodle using a GitHub Classroom compatible web browser. This should generate a private GitHub repository that can be found at https://github.com/etc5521-2025. Your GitHub assignment 1 repo should contain the file assign01.html, README.md, assign01-submission.qmd, assignment.css, etc5521-assignment1.Rproj and .gitignore.

  2. Answer each question in the assign01-submission.qmd in the repo.

  3. For the final submission knit assign01-submission.qmd which will contain your answers. Make sure to provide the link to the script of Generative AI conversation you employed in arriving at your solution. Note that marks are allocated for overall grammar and structure of your final report.

  4. Leave all of your files in your GitHub repo for marking. We will check your git commit history. You should have contributions to the repo with consistent commits over time. (Note: nothing needs to be submitted to Moodle.)

  5. You are expected to develop your solutions by yourself, without discussing any details with other class members or other friends or contacts. You can ask for clarifications from the teaching team and we encourage you to attend consultations to get assistance as needed. As a Monash student you are expected to adhere to Monash’s academic integrity policy. and the details on use of Generative AI as detailed on this unit’s Moodle assessment overview. Failure to adhere to this policy may result in a ZERO for this assignment, followed by an academic integrity breach report. The chief examiner reserves the right to question you about any part of your solution.

  6. The primary sources for methods needed for this assignment are R for Data Science (2e) and (Text Mining with R: A Tidy Approach)[https://www.tidytextmining.com).

  7. We expect that this assignment takes about 10 hours of time to complete. You should work on this analysis steadily over the time period between release and due date. Spend a couple of hours soon after the assignment is released getting started, and several hours each of the following two weeks to refine your analysis. Your GitHub commit history should reflect this working pattern.

Deadlines:

Due date Turn in Points
11:45pm Mon Aug 4 Assignment 1 Repo on GitHub has been created by due date, and commit history shows steady effort on analysis. 3
11:45pm Mon Aug 18 Final solutions available on repo. 12

Marks

Part Points
GitHub Repo 3
Q1-Q4 each worth 3
References and AI -3
Formatting, spelling, grammar, code and reproducibility -3

Appropriate use of GitHub is an important collaborative analysis skill, and demonstrating this counts towards marks.

Spelling and grammar mistakes and lack of nice formatting, detract from the score because it makes it harder to mark, and harder to read. It is expected that we can reproduce your report, so all code needs to be included, and the code needs to be readable.

Using GAI well is an emerging skill. Inadequate use or over-use without fully processing the responses detracts from a data analysis.

🛠️ Exercises

The data to use is available in the orcas R package on GitHub. The package can be installed using

remotes::install_github("jadeynryan/orcas")

to access the data. The descriptions of the variables can be found here.

Why are we interested in orca encounters? Whale watching is a major tourism business in many parts of the world. Monitoring the population of whales is important for the sustainability of many businesses and for the health of the planet. Beyond this, orcas are cool! They are intelligent, social, beautiful and one of the top predators in the ocean.

An interesting fact is that orcas routinely helped whalers hunt baleen whales in Two Fold Bay near Eden, Australia, in the 1800s. They would herd the whales into the bay for the whalers. When the whale was killed, the whalers rewarded them with the whales’ tongue and lips. Actually, this is quite gruesome 😳 😬.

library(orcas)
glimpse(cwr_tidy)
Rows: 775
Columns: 19
$ year               <dbl> 2024, 2024, 2024, 2024, 2024, 2…
$ date               <date> 2024-10-06, 2024-09-16, 2024-0…
$ encounter_sequence <chr> "1", "1", "21", "2", "1", "2", …
$ encounter_number   <dbl> 100, 96, 95, 94, 93, 92, NA, 90…
$ begin_time         <time> 10:48:00, 09:30:00, 02:15:00, …
$ end_time           <time> 12:21:00, 10:11:00, 05:00:00, …
$ duration           <chr> "5580s (~1.55 hours)", "2460s (…
$ vessel             <chr> "Mike 1", "Orcinus", "KCB III, …
$ observers          <chr> "Mark Malleson,Joe Zelwietro", …
$ pods_or_ecotype    <chr> "K, L", "L", "L", "K, L", "J", …
$ ids_encountered    <chr> NA, "L90, L128", "L90 and L128"…
$ location           <chr> "Swiftsure Bank", "Haro Strait"…
$ begin_latitude     <dbl> 49, 48, 49, 49, 49, 49, 48, 49,…
$ begin_longitude    <dbl> -125, -123, -123, -123, -123, -…
$ end_latitude       <dbl> 49, 48, 49, 49, 49, 49, 48, 49,…
$ end_longitude      <dbl> -125, -123, -123, -123, -123, -…
$ encounter_summary  <chr> "After Mark’s encounter with me…
$ nmfs_permit        <chr> "27038/DFO SARA 388", "27038/DF…
$ link               <chr> "https://www.whaleresearch.com/…

Question 1

Summarise the temporal patterns in this data. For example,

  • what is the time frame of the data collection?
  • are there any seasonal patterns in the measurements?
  • what is the usual length of the encounters?

Overall time

Examining the year variable to get an overview of the timing of measurements.

Code
yr_smry <- cwr_tidy |> count(year) 
yr_smry
# A tibble: 9 × 2
   year     n
  <dbl> <int>
1  2017   104
2  2018    98
3  2019   107
4  2020    76
5  2021    96
6  2022   101
7  2023    94
8  2024    89
9    NA    10

We can see that there are a similar number of encounters every year between 2017 and 2024. We also notice that there are some missing values on year, and these observations might need removing for later analysis.

Seasonality

Next extract the month and examine if there are any seasonal trends.

Code
library(lubridate)
cwr_tidy <- cwr_tidy |>
  mutate(month = month(date, label=TRUE))
cwr_tidy |>
  filter(!is.na(month)) |>
  ggplot(aes(x=month)) +
  geom_bar() + 
  facet_wrap(~year, ncol=4) +
  xlab("")
Figure 1

Figure 1 doesn’t show any strong seasonal patterns. There is some hint of more encounters Jun-Oct, and less in Nov-Dec, but the variation from month to month is large. This suggests that orcas are around this area all year long. Note: If you don’t facet, you will find that the number per month is reasonably constant, for Jan-Oct, and less in Nov-Dec. However, this is something that you should be careful doing. You should always examine on strata and smaller time scale, which reveals the month to month variability for this data.

Duration of encounters

Code
cwr_tidy <- cwr_tidy |>
  mutate(duration_num = str_remove_all(duration, "\\([^)]*\\)")) |>
  mutate(duration_num = as.numeric(str_remove_all(duration_num, "[A-Za-z\\-~\\s]")))
ggplot(cwr_tidy, aes(duration_num)) +
  geom_histogram() +
  xlab("Duration (sec)")
Figure 2

From Figure 2 the distribution is bimodal. We can see that most encounters were around 1.5 hours, which is the mode of the first peak. The second mode has a peak around 9-10 hours!

Question 2

Summarise the spatial patterns, for example

  • where are these encounters happening?
  • are the boats following the whales based on the tracks of the encounters?

Spatial location

Plotting the longitude and latitude, particularly overlaid on a map, can help work out where the encounters are happening.

Code
library(leaflet)
cwr_tidy |>
  leaflet() |>
  addTiles() |>
  addCircleMarkers(
    radius=1, 
    opacity = 0.5, 
    color = "hotpink", 
    label = ~date,
    lat = ~begin_latitude, lng = ~begin_longitude) 

These encounters are all in the Puget Sound area of northwest USA, and western Canada.

Encounter tracks

Code
library(ggthemes)
cwr_tidy |>
  filter(!is.na(year)) |>
  ggplot() +
    geom_segment(aes(x=begin_longitude, 
                     y=begin_latitude,
                     xend=end_longitude,
                     yend=end_latitude)) +
    geom_point(aes(x=begin_longitude, 
                     y=begin_latitude), shape = 1) +
    facet_wrap(~month, ncol=4) +
    coord_map() +
    theme_map() +
    theme(panel.border = element_rect(colour = "black", fill = NA))
Figure 3: Encounters with segments marking start and end, and circles indicating the start. Most are short. There is little difference between years.

Most of the encounters are short distance, seen by the small or non-existent lengths, as seen in Figure 3.

Code
library(ggthemes)
cwr_tidy |>
  filter(!is.na(month)) |>
  ggplot() +
    geom_segment(aes(x=begin_longitude, 
                     y=begin_latitude,
                     xend=end_longitude,
                     yend=end_latitude)) +
    geom_point(aes(x=begin_longitude, 
                     y=begin_latitude), shape = 1) +
    facet_wrap(~year, ncol=4) +
    coord_map() +
    theme_map() +
    theme(panel.border = element_rect(colour = "black", fill = NA))
Figure 4: Encounters with segments marking start and end, and circles indicating the start. Most are short. September is when there seems to be more varied locations of encounters.

Question 3

Summarise the encounters by the vessels and observers. Are they the same whales that are frequently seen?

  • Are there some especially frequent observers or active vessels?
  • Are the long encounters made by special vessels?
  • Do some vessels make multiple encounters?
Code
vessel_cnt <- cwr_tidy |>
  count(vessel, sort=TRUE) 
vessel_cnt |>
  filter(n > 9)
# A tibble: 5 × 2
  vessel           n
  <chr>        <int>
1 Orcinus        312
2 Mike 1         300
3 KCB III         43
4 Morning Star    33
5 Chimo           22

There are two especially active vessels Orcinus and Mike 1.

Code
cwr_tidy |> 
  filter(duration_num > 30000) |>
  count(vessel, sort=TRUE) 
# A tibble: 4 × 2
  vessel                                               n
  <chr>                                            <int>
1 Mike 1                                              45
2 Orcinus                                             33
3 KCB III                                             13
4 KCB IIIRachel John, Arlene Vargas, Charli Grimes     1

It is the same two vessels also are the ones that made long encounters. Some error in the data in that some vessel names and observer names are mashed together.

Code
cwr_tidy |> 
  count(observers, sort=TRUE) 
# A tibble: 251 × 2
   observers                                n
   <chr>                                <int>
 1 Dave Ellifrit                           96
 2 Mark Malleson                           76
 3 Mark Malleson, Joe Zelwietro            40
 4 Mark Malleson,Joe Zelwietro             24
 5 Dave Ellifrit, Katie Jones              22
 6 Ken Balcomb                             22
 7 Melisa Pinnow, Jane Cogan, Tom Cogan    22
 8 Mark Malleson,Brendon Bissonnette       18
 9 Dave Ellifrit, Michael Weiss            12
10 Mark Malleson, Hanna Magnusson          12
# ℹ 241 more rows

XXX Maybe need to split names, and recount. Are these always associated with the same boats?

Code
cwr_tidy |> 
  count(pods_or_ecotype, sort=TRUE) 
# A tibble: 100 × 2
   pods_or_ecotype          n
   <chr>                <int>
 1 Transients             339
 2 J                       72
 3 Bigg's killer whales    61
 4 J, K, L                 30
 5 J Pod                   25
 6 L                       21
 7 J pod                   20
 8 Bigg's Transients       12
 9 J pod and L87           12
10 J, L                    12
# ℹ 90 more rows

XXX This information is a bit inconsistent. It needs more processing.

Question 4

Each encounter has a text description. Summarise the common words used for the encounters.

Code
library(tidytext)
library(ggwordcloud)
cwr_text <- cwr_tidy |>
  unnest_tokens(word, encounter_summary) |>
  select(word) |>
  anti_join(stop_words)

cwr_text_top <- cwr_text |> 
  filter(!is.na(word)) |>
  count(word, sort=TRUE) |>
  slice_head(n=100) 
cwr_text_top |>
  ggplot() +
  geom_text_wordcloud_area(aes(label = word, size=n)) +
  scale_size_area(max_size = 30) + 
  theme_minimal()
Figure 5: Word cloud of top 100 words.

The text descriptions are not especially interesting word-wise (Figure 5), because the top words are whales, south, island, north, encounter, west, mark, east, headed, heading. A few of the top 100 words are interesting, “dave”, “mike” and “joe” feature. The words “calf”, “seal”, “fish” suggest the nature of the encounters are more interesting, and “milling”, “foraging”, suggest activities of the whales. More processing of the text could be done to remove uninteresting words, and also look for two or three words used together, n-grams.

Resources

In this part, you should cite major resources used, including R packages, and actively discuss how generative AI helped with your answers to the assignment questions, and where or how it was mistaken or misleading.

You need to provide links to the full script of your conversations generative AI tools. You should not use a paid service, as the freely available systems will sufficiently helpful.

For example, the citation() function in R can give R package details:

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686..

The links to my use of ChatGPT for help on this assignment are:

Rubric

To help you complete in your report, below is a rubric to guide you to what we are expecting:

content Excellent (HD) Very good (D) Good (C) Satisfactory (P) Unsatisfactory (F)
Q1-4 Plots and summaries are comprehensive and concise. Summaries are designed well and polished. Text is used to summarise the tables and plots. Anything interesting or problematic with the data is discussed. Plots and summaries are complete and concise. Summaries are appropriate. Text is addd to summarise the tables and plots. Interesting or problematic aspects of the data are reported. Plots and summaries mostly answer questions raised by the initial expectations, are provide good answers and reasoning. Plots and summaries are not complete or there are too many summaries provided. Summaries are mostly appropriate. Text is addd to summarise the tables and plots. Plots and tables are not well matched to the needed, and text explanations are not provided. Plots are not readable.
Repo Actively commiting during entire assignment period, with informative commit messages, done after each small change to the work. Commiting changes during most of assignment period, with clear evolution of the data analysis. (3) Repo not accepted in time, and less than 11 commits (2) Repo not accepted in time, and less than 5 commits (1) Repo not accepted in time, and a single commit (0)
GAI GAI used effectively, deeply, and explained, script linked to report (0 deduction) GAI used but not explained well or inadequate and script linked to report (-0.5) Shallow use of GAI, script linked to report (-1) Clearly used but no script linked to report (-3)
Reproducibility No changes needed for report to reproduce exactly as provided. Code is nicely formatted, commented and readable using appropriate tidyverse standards. (0 deduction) Single change needed for report to reproduce exactly as provided. Code is nicely formatted, commented and readable using appropriate tidyverse standards. (0 deduction) Just a few changes needed for report to reproduce exactly as provided. Code is formatted, commented and readable using appropriate tidyverse standards. (-0.5) Multiple changes needed for report to reproduce exactly as provided. Code is readable. (-1) Cannot easily make changes for report to reproduce at all. Code is not readable. (-2)
Spelling/Grammar Writing style is exceptional, scholarly and succinct that is free from spelling, grammar and punctuation errors. (0 deduction) Writing style is scholarly, free from spelling, grammar and punctuation errors. (0 deduction) Writing style is scholarly, but wordy and inconcise. Free from spelling, grammar and punctuation errors. (-0.5) Writing is scholarly and wordy. Contains some grammatical, punctuation and spelling errors. (-1) Writing is unscholarly. Many grammatical, punctuation and spelling errors. (-2)
References The appropriate referencing style has been used consistently, with no errors. Includes citations for software used, and data sources. (0 deduction) The appropriate referencing style has been used consistently, with very few errors, and includes software used, and data sources. (0 deduction) The appropriate referencing style has been used consistently, and only a few citations missing. (-0.5) The appropriate referencing style has been used much of the time, missing some major sources that were clearly used. (-1) Material used from external sources without citation. (-2)